1 Themes d for this doc

2 Basics

2.3 Containers

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  num [1:7] 1 4 7 10 13 16 19
## [1]    1    2    3    4  100   58 5568
##  num [1:7] 1 2 3 4 100 ...
## [1] "hey!"
## [1] "jon"   "Peter" "Sam"
##  chr [1:3] "jon" "Peter" "Sam"

2.4 Factors

##  chr [1:4] "Dec" "May" "Apr" "Dec"
##  Factor w/ 3 levels "Apr","Dec","May": 2 3 1 2
## birth_month
## Apr Dec May 
##   1   2   1
##  Factor w/ 12 levels "Jan","Feb","Mar",..: 12 5 4 12
## [1] Dec May Apr
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## birth_month
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##   0   0   0   1   1   0   0   0   0   0   0   2

3 dplyr (Data Manipulation)

Can view help by vignette("dplyr") and vignette("two-table") or check out the online docs dplyr is a part of tidyverse

3.1 Filter

Can Use | as or

## 
##            Aboriginal     Arabic    Aramaic    Bosnian  Cantonese 
##         12          2          5          1          1         11 
##    Chinese      Czech     Danish       Dari      Dutch   Dzongkha 
##          3          1          5          2          4          1 
##    English   Filipino     French     German      Greek     Hebrew 
##       4704          1         73         19          1          5 
##      Hindi  Hungarian  Icelandic Indonesian    Italian   Japanese 
##         28          1          2          2         11         18 
##    Kannada     Kazakh     Korean   Mandarin       Maya  Mongolian 
##          1          1          8         26          1          1 
##       None  Norwegian    Panjabi    Persian     Polish Portuguese 
##          2          4          1          4          4          8 
##   Romanian    Russian  Slovenian    Spanish    Swahili    Swedish 
##          2         11          1         40          1          5 
##      Tamil     Telugu       Thai       Urdu Vietnamese       Zulu 
##          1          1          3          1          1          2
## [1] 308  28

3.3 slice

Only see certain rows

3.4 select

Used to pick out certain variables

## [1] 5043    4
## [1] "movie_title"   "director_name" "gross"         "budget"
##  [1] "movie_title"               "director_name"            
##  [3] "gross"                     "budget"                   
##  [5] "color"                     "num_critic_for_reviews"   
##  [7] "duration"                  "director_facebook_likes"  
##  [9] "actor_3_facebook_likes"    "actor_2_name"             
## [11] "actor_1_facebook_likes"    "genres"                   
## [13] "actor_1_name"              "num_voted_users"          
## [15] "cast_total_facebook_likes" "actor_3_name"             
## [17] "facenumber_in_poster"      "plot_keywords"            
## [19] "movie_imdb_link"           "num_user_for_reviews"     
## [21] "language"                  "country"                  
## [23] "content_rating"            "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

3.5 Select Helpers

  • starts_with(): Starts with a prefix.
  • ends_with(): Ends with a suffix.
  • contains(): Contains a literal string.
  • matches(): Matches a regular expression.
  • num_range(): Matches a numerical range like x01, x02, x03.
  • one_of(): Matches variable names in a character vector.
  • everything(): Matches all variables.
  • last_col(): Select last variable, possibly with an offset.

3.6 rename

##  [1] "movie_title"               "director"                 
##  [3] "gross"                     "budget"                   
##  [5] "color"                     "num_critic_for_reviews"   
##  [7] "duration"                  "director_facebook_likes"  
##  [9] "actor_3_facebook_likes"    "actor_2_name"             
## [11] "actor_1_facebook_likes"    "genres"                   
## [13] "actor_1_name"              "num_voted_users"          
## [15] "cast_total_facebook_likes" "actor_3_name"             
## [17] "facenumber_in_poster"      "plot_keywords"            
## [19] "movie_imdb_link"           "num_user_for_reviews"     
## [21] "language"                  "country"                  
## [23] "content_rating"            "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

3.8 group_by

3.9 Summerize

3.10 Pulling it together

##  [1]  0.2909999  0.2755208  2.0667876 12.3284782  0.4697230  0.2315017
##  [7]  1.6732844  0.2963527  7.1413626  1.1121905         NA         NA
## [13] 40.1424230 11.4088478  0.4763281  1.0834822  0.6181190  1.4418917
## [19]  0.4705548

4 magrittr (pipe opperator)

Can view help by vignette(“magrittr”) or check out the online docs magrittr is a part of tidyverse

We start with a value, here mtcars (a data.frame). Based on this, we first extract a subset, then we aggregate the information based on the number of cylinders, and then we transform the dataset by adding a variable for kilometers per liter as supplement to miles per gallon. Finally we print the result before assigning it. Note how the code is arranged in the logical order of how you think about the task: data->transform->aggregate, which is also the same order as the code will execute. It’s like a recipe – easy to read, easy to follow!

library(magritter)

##   cyl   mpg   disp     hp drat   wt  qsec   vs   am gear carb       kpl
## 1   4 25.90 108.05 111.00 3.94 2.15 17.75 1.00 1.00 4.50 2.00 11.010090
## 2   6 19.74 183.31 122.29 3.59 3.12 17.98 0.57 0.43 3.86 3.43  8.391474
## 3   8 15.10 353.10 209.21 3.23 4.00 16.77 0.00 0.14 3.29 3.50  6.419010

Note also how “building” a function on the fly for use in aggregate is very simple in magrittr: rather than an actual value as left-hand side in pipeline, just use the placeholder. This is also very useful in R’s *apply family of functions.

The combined example shows a few neat features of the pipe (which it is not):

  1. By default the left-hand side (LHS) will be piped in as the first argument of the function appearing on the
  2. right-hand side (RHS). This is the case in the subset and transform expressions.
  3. %>% may be used in a nested fashion, e.g. it may appear in expressions within arguments. This is used in the mpg to kpl conversion.
  4. When the LHS is needed at a position other than the first, one can use the dot,‘.’, as placeholder. This is used in the aggregate expression.
  5. The dot in e.g. a formula is not confused with a placeholder, which is utilized in the aggregate expression.
  6. Whenever only one argument is needed, the LHS, then one can omit the empty parentheses. This is used in the call to print (which also returns its argument). Here, LHS %>% print(), or even LHS %>% print(.) would also work.
  7. A pipeline with a dot (.) as LHS will create a unary function. This is used to define the aggregator function.

One feature, which was not utilized above is piping into anonymous functions, or lambdas. This is possible using standard function definitions, e.g.

However, magrittr also allows a short-hand notation:

4.1 Additional Pipe Opperators

4.1.1 Tee %T>%

The “tee” operator, %T>% works like %>%, except it returns the left-hand side value, and not the result of the right-hand side operation. This is useful when a step in a pipeline is used for its side-effect (printing, plotting, logging, etc.). As an example:

## [1]  7.392887 -3.946883

4.1.2 Exposition %$%

The “exposition” pipe operator, %$% exposes the names within the left-hand side object to the right-hand side expression. Essentially, it is a short-hand for using the with functions (and the same left-hand side objects are accepted). This operator is handy when functions do not themselves have a data argument, as for example lm and aggregate do. Here are a few examples as illustration:

## [1] 0.3361992

4.1.3 Compound assignment %<>%

Finally, the compound assignment pipe operator %<>% can be used as the first pipe in a chain. The effect will be that the result of the pipeline is assigned to the left-hand side object, rather than returning the result as usual. It is essentially shorthand notation for expressions like foo <- foo %>% bar %>% baz, which boils down to foo %<>% bar %>% baz. Another example is

The %<>% can be used whenever expr <- … makes sense, e.g.

  • x %<>% foo %>% bar
  • x[1:10] %<>% foo %>% bar
  • x$baz %<>% foo %>% bar

5 Plot

5.1 ggplot

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

CHEATSHEET

5.3 Bar Plot

geom_bar makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

5.4 Jitter

The jitter geom is a convenient shortcut for geom_point(position = "jitter"). It adds a small amount of random variation to the location of each point, and is a useful way of handling overplotting caused by discreteness in smaller datasets.

## Warning: Removed 18388 rows containing missing values (geom_point).

5.5 Histogram

Visualise the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars; frequency polygons (geom_freqpoly()) display the counts with lines. Frequency polygons are more suitable when you want to compare the distribution across the levels of a categorical variable.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5.7 Line

In a line graph, observations are ordered by x value and connected.

The functions geom_line(), geom_step(), or geom_path() can be used.

x value (for x axis) can be :

  • date : for a time series data
  • texts
  • discrete numeric values
  • continuous numeric values

Data derived from ToothGrowth data sets are used. ToothGrowth describes the effect of Vitamin C on tooth growth in Guinea pigs.

  • len - Tooth Length
  • dose - Dose in mg (0.5, 1, 2)

See more here

Observations can be also connected using the functions geom_step() or geom_path() :

5.9 Themes

library("ggthemes") docs

Theme Library

Base example

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

5.9.1 ggthemes

##  [1] "theme_base"            "theme_calc"           
##  [3] "theme_clean"           "theme_economist"      
##  [5] "theme_economist_white" "theme_excel"          
##  [7] "theme_excel_new"       "theme_few"            
##  [9] "theme_fivethirtyeight" "theme_foundation"     
## [11] "theme_gdocs"           "theme_hc"             
## [13] "theme_igray"           "theme_map"            
## [15] "theme_pander"          "theme_par"            
## [17] "theme_solarized"       "theme_solarized_2"    
## [19] "theme_solid"           "theme_stata"          
## [21] "theme_tufte"           "theme_wsj"

6 Classification and Resampling

6.3 Downsampling and up-sampling

## Loaded ROSE 0.0-3
## 
##  No Yes 
## 320 346
## 
##   No  Yes 
## 5939 6061
## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = data_rose_down$data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5946  -0.4317   0.1413   0.5672   2.4312  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.3052888  0.4261671  -12.45   <2e-16 ***
## balance      0.0041077  0.0003048   13.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 922.26  on 665  degrees of freedom
## Residual deviance: 477.42  on 664  degrees of freedom
## AIC: 481.42
## 
## Number of Fisher Scoring iterations: 6
## 
## Call:
## glm(formula = default ~ balance, family = "binomial", data = data_rose_up$data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3786  -0.3836   0.0632   0.4573   2.9886  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.657e+00  1.250e-01  -53.26   <2e-16 ***
## balance      5.065e-03  9.035e-05   56.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16634  on 11999  degrees of freedom
## Residual deviance:  7639  on 11998  degrees of freedom
## AIC: 7643
## 
## Number of Fisher Scoring iterations: 6

6.4 Resampling Methods

6.8 more complicated function for bootstrapping

## 
## Attaching package: 'boot'
## The following object is masked from 'package:lattice':
## 
##     melanoma

6.9 Forward Stepwise Selection

## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = Auto_sub, nvmax = 7, method = "forward")
## 8 Variables  (and intercept)
##              Forced in Forced out
## cylinders        FALSE      FALSE
## displacement     FALSE      FALSE
## horsepower       FALSE      FALSE
## weight           FALSE      FALSE
## acceleration     FALSE      FALSE
## year             FALSE      FALSE
## origin           FALSE      FALSE
## folds            FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
##          cylinders displacement horsepower weight acceleration year origin
## 1  ( 1 ) " "       " "          " "        "*"    " "          " "  " "   
## 2  ( 1 ) " "       " "          " "        "*"    " "          "*"  " "   
## 3  ( 1 ) " "       " "          " "        "*"    " "          "*"  "*"   
## 4  ( 1 ) " "       "*"          " "        "*"    " "          "*"  "*"   
## 5  ( 1 ) " "       "*"          "*"        "*"    " "          "*"  "*"   
## 6  ( 1 ) "*"       "*"          "*"        "*"    " "          "*"  "*"   
## 7  ( 1 ) "*"       "*"          "*"        "*"    "*"          "*"  "*"   
##          folds
## 1  ( 1 ) " "  
## 2  ( 1 ) " "  
## 3  ( 1 ) " "  
## 4  ( 1 ) " "  
## 5  ( 1 ) " "  
## 6  ( 1 ) " "  
## 7  ( 1 ) " "

6.10 Backward Stepwise Selection

## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = Auto_sub, nvmax = 7, method = "backward")
## 8 Variables  (and intercept)
##              Forced in Forced out
## cylinders        FALSE      FALSE
## displacement     FALSE      FALSE
## horsepower       FALSE      FALSE
## weight           FALSE      FALSE
## acceleration     FALSE      FALSE
## year             FALSE      FALSE
## origin           FALSE      FALSE
## folds            FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
##          cylinders displacement horsepower weight acceleration year origin
## 1  ( 1 ) " "       " "          " "        "*"    " "          " "  " "   
## 2  ( 1 ) " "       " "          " "        "*"    " "          "*"  " "   
## 3  ( 1 ) " "       " "          " "        "*"    " "          "*"  "*"   
## 4  ( 1 ) " "       "*"          " "        "*"    " "          "*"  "*"   
## 5  ( 1 ) " "       "*"          "*"        "*"    " "          "*"  "*"   
## 6  ( 1 ) "*"       "*"          "*"        "*"    " "          "*"  "*"   
## 7  ( 1 ) "*"       "*"          "*"        "*"    "*"          "*"  "*"   
##          folds
## 1  ( 1 ) " "  
## 2  ( 1 ) " "  
## 3  ( 1 ) " "  
## 4  ( 1 ) " "  
## 5  ( 1 ) " "  
## 6  ( 1 ) " "  
## 7  ( 1 ) " "

7 Classification and Regression

You can use str to get info about what is contained in a model ie: str(mod1) ## Setup Test/Train

Where 0.75 is the percentage (75%) of the data to put in the Training set.

7.1 Linear Regression

Linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. Generate a linear model with lm(), desired formula is written with the dependant variable followed by ~ and then a list of the independant variables Can use . for all, or do something like y ~ -director Can get the coefficients like this mod1$coefficients[1]

## 
## Call:
## lm(formula = gross ~ budget + duration, data = movies_train)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -214888240  -23184331   -9165431   12757656  488641764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.415e+07  4.971e+06  -2.846  0.00445 ** 
## budget       1.023e+00  2.344e-02  43.665  < 2e-16 ***
## duration     2.443e+05  4.624e+04   5.284 1.36e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 52570000 on 2927 degrees of freedom
##   (474 observations deleted due to missingness)
## Multiple R-squared:  0.4349, Adjusted R-squared:  0.4345 
## F-statistic:  1126 on 2 and 2927 DF,  p-value: < 2.2e-16

7.2 Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Use function glm() notice the family = binomial

## 
## Call:
## glm(formula = factor(default) ~ balance, family = binomial, data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2697  -0.1465  -0.0589  -0.0221   3.7589  
## 
## Coefficients:
##                Estimate  Std. Error z value Pr(>|z|)    
## (Intercept) -10.6513306   0.3611574  -29.49   <2e-16 ***
## balance       0.0054989   0.0002204   24.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1596.5  on 9998  degrees of freedom
## AIC: 1600.5
## 
## Number of Fisher Scoring iterations: 8
## (Intercept)     balance 
##    -10.6513      0.0055

7.3 Ridge Regression

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loaded glmnet 2.0-18
## 
## Attaching package: 'glmnetUtils'
## The following objects are masked from 'package:glmnet':
## 
##     cv.glmnet, glmnet
## 31 x 1 sparse Matrix of class "dgCMatrix"
##                                               1
## (Intercept)                   375.8728467164299
## color                           3.3709229184702
## color Black and White          -1.4710509793354
## colorColor                      1.5332871913922
## num_critic_for_reviews          0.0000661387495
## duration                       -0.0560651087860
## director_facebook_likes        -0.0001810292387
## actor_3_facebook_likes          0.0002041101931
## actor_1_facebook_likes         -0.0000430491628
## gross                           0.0000008036590
## num_voted_users                 0.0000290618332
## cast_total_facebook_likes       0.0000061427897
## facenumber_in_poster            0.0473720680504
## num_user_for_reviews            0.0006816099097
## content_ratingPG                1.1396240741168
## content_ratingPG-13             0.2819483822036
## content_ratingR                -0.9505434689214
## content_ratingOther             0.5102102078375
## budget                         -0.0000007517667
## title_year                     -0.1874436349418
## actor_2_facebook_likes         -0.0000069751333
## imdb_score                      0.7207706122038
## aspect_ratio                   -1.4278392481220
## movie_facebook_likes            0.0000078108292
## genre_mainAction               -2.6892740660307
## genre_mainAdventure            -1.3700422837322
## genre_mainComedy                2.0182528348528
## genre_mainCrime                -1.0237554171846
## genre_mainDrama                 0.5432805069308
## genre_mainOther                 2.0246784070736
## cast_total_facebook_likes000s   0.0147232247410
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

7.4 Lasso Regression

## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                              1
## (Intercept)                   1817.52597097060
## color                            6.21718603947
## color Black and White          -16.32634322052
## colorColor                       .            
## num_critic_for_reviews           0.02200497574
## duration                        -0.23340304359
## director_facebook_likes         -0.00116426095
## actor_3_facebook_likes          -0.00888584852
## actor_1_facebook_likes          -0.00744714247
## num_voted_users                  0.00018764443
## cast_total_facebook_likes        0.00730527750
## facenumber_in_poster             0.01932334558
## num_user_for_reviews            -0.00141989884
## content_ratingPG                 6.78067685064
## content_ratingPG-13              .            
## content_ratingR                -10.13906160797
## content_ratingOther              0.31165685136
## title_year                      -0.89939854327
## actor_2_facebook_likes          -0.00725731678
## imdb_score                       2.28958064135
## aspect_ratio                    -5.96032233036
## movie_facebook_likes            -0.00003218491
## genre_mainAction               -13.02280025232
## genre_mainAdventure             -5.60996877473
## genre_mainComedy                 5.46476603293
## genre_mainCrime                -10.53303103402
## genre_mainDrama                  .            
## genre_mainOther                  5.92525059365
## cast_total_facebook_likes000s    0.01839641166
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                            1
## (Intercept)                   719.9698526732
## color                           .           
## color Black and White           .           
## colorColor                      .           
## num_critic_for_reviews          .           
## duration                       -0.0043326804
## director_facebook_likes         .           
## actor_3_facebook_likes          .           
## actor_1_facebook_likes          .           
## num_voted_users                 0.0001601412
## cast_total_facebook_likes       .           
## facenumber_in_poster            .           
## num_user_for_reviews            .           
## content_ratingPG                0.9868246546
## content_ratingPG-13             .           
## content_ratingR                -5.8747524831
## content_ratingOther             .           
## title_year                     -0.3569338609
## actor_2_facebook_likes          .           
## imdb_score                      .           
## aspect_ratio                   -2.7581949479
## movie_facebook_likes            .           
## genre_mainAction               -4.1613186904
## genre_mainAdventure             .           
## genre_mainComedy                3.2939157095
## genre_mainCrime                 .           
## genre_mainDrama                 .           
## genre_mainOther                 .           
## cast_total_facebook_likes000s   .

7.5 ElasticNet

## [1] 0.00 0.25 0.50 0.75 1.00
## Call:
## cva.glmnet.formula(formula = profitM ~ ., data = movies_train, 
##     alpha = alpha_list)
## 
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha values: 0 0.25 0.5 0.75 1
##     Number of crossvalidation folds for lambda: 10

## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)                   507.144
## color                           .    
## color Black and White           .    
## colorColor                      .    
## num_critic_for_reviews          .    
## duration                        .    
## director_facebook_likes         .    
## actor_3_facebook_likes          .    
## actor_1_facebook_likes          .    
## num_voted_users                 0.000
## cast_total_facebook_likes       .    
## facenumber_in_poster            .    
## num_user_for_reviews            .    
## content_ratingPG                .    
## content_ratingPG-13             .    
## content_ratingR                -3.881
## content_ratingOther             .    
## title_year                     -0.252
## actor_2_facebook_likes          .    
## imdb_score                      .    
## aspect_ratio                   -1.041
## movie_facebook_likes            .    
## genre_mainAction               -1.895
## genre_mainAdventure             .    
## genre_mainComedy                1.382
## genre_mainCrime                 .    
## genre_mainDrama                 .    
## genre_mainOther                 .    
## cast_total_facebook_likes000s   .
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                              1
## (Intercept)                   1824.83456917791
## color                            6.78406927311
## color Black and White          -16.42216538689
## colorColor                       .            
## num_critic_for_reviews           0.02243286736
## duration                        -0.23490224956
## director_facebook_likes         -0.00116390727
## actor_3_facebook_likes          -0.00935228173
## actor_1_facebook_likes          -0.00777160447
## num_voted_users                  0.00018773817
## cast_total_facebook_likes        0.00531736364
## facenumber_in_poster             0.02480761575
## num_user_for_reviews            -0.00154877509
## content_ratingPG                 6.80214813322
## content_ratingPG-13              .            
## content_ratingR                -10.12517852647
## content_ratingOther              0.32191366168
## title_year                      -0.90314657015
## actor_2_facebook_likes          -0.00758867152
## imdb_score                       2.32360789565
## aspect_ratio                    -5.95505883297
## movie_facebook_likes            -0.00003347691
## genre_mainAction               -13.15425499855
## genre_mainAdventure             -5.73089418600
## genre_mainComedy                 5.34359619040
## genre_mainCrime                -10.66799259723
## genre_mainDrama                  .            
## genre_mainOther                  5.87709172623
## cast_total_facebook_likes000s    2.33040943783
## 29 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)                   523.253
## color                           .    
## color Black and White           .    
## colorColor                      .    
## num_critic_for_reviews          .    
## duration                        .    
## director_facebook_likes         .    
## actor_3_facebook_likes          .    
## actor_1_facebook_likes          .    
## num_voted_users                 0.000
## cast_total_facebook_likes       .    
## facenumber_in_poster            .    
## num_user_for_reviews            .    
## content_ratingPG                .    
## content_ratingPG-13             .    
## content_ratingR                -3.980
## content_ratingOther             .    
## title_year                     -0.261
## actor_2_facebook_likes          .    
## imdb_score                      .    
## aspect_ratio                   -1.063
## movie_facebook_likes            .    
## genre_mainAction               -2.099
## genre_mainAdventure             .    
## genre_mainComedy                1.550
## genre_mainCrime                 .    
## genre_mainDrama                 .    
## genre_mainOther                 .    
## cast_total_facebook_likes000s   .

7.7 Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  100   42
##        No   233 9625
##                                                
##                Accuracy : 0.9725               
##                  95% CI : (0.9691, 0.9756)     
##     No Information Rate : 0.9667               
##     P-Value [Acc > NIR] : 0.0004973            
##                                                
##                   Kappa : 0.4093               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
##                                                
##             Sensitivity : 0.3003               
##             Specificity : 0.9957               
##          Pos Pred Value : 0.7042               
##          Neg Pred Value : 0.9764               
##              Prevalence : 0.0333               
##          Detection Rate : 0.0100               
##    Detection Prevalence : 0.0142               
##       Balanced Accuracy : 0.6480               
##                                                
##        'Positive' Class : Yes                  
## 

8 ForCats (working with factors)

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include:

  • fct_reorder(): Reordering a factor by another variable.
  • fct_infreq(): Reordering a factor by the frequency of values.
  • fct_relevel(): Changing the order of a factor by hand.
  • fct_lump(): Collapsing the least/most frequent values of a factor into “other”.